Chatbot dialogue system:
- Generative approach (not focused in this paper)
- Retrival approach: better control over response quality than generative approaches. Selct outputs from a whitelist of candidate responses.
Challenge:
- Inference speed
- Whitelist selction process and the associated retrieval evaluation. Recall: over-simplified.
This paper use SRU instead of LSTM. It is faster in training and inference
2 inputs:
- context c: concatenate of all utterances in the conversation. We use special tokens to indicate whether each utterance comes from the customer or the agent.
- candidate response r.
1 output
A score s(c, r) indicating the relenvance of the response to the context
Core of the model, 2 neural encoders fc and fr to encode the context and the response, respectively. They have identical architecture but sperate weights.
Input of encoders: w = w1, w2, ...wn, either text or response.
Use fastText as word embedding method due to the prevalence of typos in both user and agent utterance in real chats.
Each encoder consists of 1 recurrent neural network and 1 multi-headed attention layer.
Recurrent neural network: multi-layer, bidirectional SRUs, each layer of SRU involves the following computation
where σ is the sigmoid activation function. W, Wf, Wr ∈ Rdh * de are learned matrices and vf, vr, bf, bv ∈ Rdh are learned parameters
The multi-headed attention layer compresses the encoded sequence h = h1, h2, ...hn into a single vector. For each attention head i, attention weights are generating with the following computation
α(i) = softmax(σ(hTWa(i))va(i))
where σ is a non-linear activation function, Wa(i) ∈ Rdh * da is a learned parameter matrix and va(i) ∈ Rda is a learned parameter vector.
The encoded sequence representation is the pooled to a single vector for each attention head i by summing the attended representation
Finally, the pooled encodings are averaged accross the nh attention head
The output of the encoder function is the vector f(w) = ĥ
To determine the relevance of a response r to a context c, the model computes a matching scoring between the context encoding fc(c) and the response encoding fr(r). This score is simply the dot product of the encodings:
..math:: s(c, r) = f_c(c) cdot f_r(r)
We optimize the model the maximize the score between the context c and responce r+ actually sent by the agent while minimize the score between the context and each of k random "negative" response r1−, ..., rk−. Although negative response could be sampled seperately for each context-response pair, we instead share a set of negative response accross all examples in a batch. This has the benefit of reducing the number of responsees that need to be encoded in each batch of size of b from O(kb) to O(k + b)
2 method of create a whitelist from which our model selects responses at inference time:
- Frequency-based method: select 1,000 or 10,000 most common agent responses.
- Clustering-base method: encode all agent responses using response encoder fr and use k-means clustering with k = 1000 or k = 10,000 to cluster the reponse. Then selected the most common response from each cluster to create whitelist